Toward Off-Policy Learning Control with Function Approximation

نویسندگان

  • Hamid Reza Maei
  • Csaba Szepesvári
  • Shalabh Bhatnagar
  • Richard S. Sutton
چکیده

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation. In reinforcement learning, the term “off-policy learning” refers to learning about one way of behaving, called the target policy, from data generated by another way of selecting actions, called the behavior policy. The target policy is often an approximation to the optimal policy, which is typically deterministic, whereas the behavior policy is often stochastic, exploring all possible actions in each state as part of finding the optimal policy. Freeing the behavior policy from the target policy enables a greater variety of exploration strategies to be used. It also enables learning from training data generated by unrelated controllers, including manual human control, and from previously collected data. A third reason for interest in off-policy learning is that it permits learning about multiple target policies (e.g., optimal policies for multiple subgoals) from a single stream of data generated by a Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). single behavior policy. Off-policy learning for tabular (non-approximate) settings is well understood; there exist simple, online algorithms such as Q-learning (Watkins & Dayan, 1992) which converge to the optimal target policy under minimal conditions. For approximation settings, however, results are much weaker. One promising recent development is gradient-based temporal-difference (TD) learning methods, which have been proven stable under off-policy learning for linear (Sutton et al., 2009a) and nonlinear (Maei et al., 2010) function approximators. However, so far this work has only applied to prediction settings, in which both the target and behavior policy are stationary. In this paper we generalize prior work with gradient TD methods by allowing changes in the target policy. In particular, we consider learning an approximation to the optimal action-value function (thereby finding an approximately optimal target policy) from data generated by an arbitrary stationary behavior policy. We call this problem setting latent learning because the optimal policy is learned but remains latent; it is not allowed to be overtly expressed in behavior. Our latent learning result could be extended further, for example to allow the behavior policy to change slowly as long as it remained sufficiently exploratory, but it is already a significant step. Our results build on ideas from prior work with gradient TD methods but require substantially different techniques to deal with the control case. We present a new latent learning algorithm, GreedyGQ, which possesses a number of properties that we find desirable: 1) Linear function approximation; 2) No restriction on the features used; 3) Online, incremental, with memory and per-time-step computation costs that are linear in the number of features; and 4) Convergent to a local optimum or equilibrium point. Alternative ways of solving the latent learning problem include using non-incremental methods that are more computationally expensive (e.g., Lagoudakis & Parr, 2003; Antos et al., 2008; 2007), possibly with nonlinear value function approximation methods (e.g., Antos et al., 2008; 2007); putting restrictions on the linear function approximation method (Gordon, 1995; Szepesvári & Smart, 2004), or on the interaction of Toward Off-Policy Learning Control with Function Approximation the sample and the features (Melo et al., 2008). Nonincremental methods that allow non-linear value function approximation are an interesting alternative. Because they are non-incremental, there are no stability issues arising. The price is that their computational complexity is harder to control. For a discussion of the relative merits of (non-)incremental methods the reader is referred to Section 2.2.3 of (Szepesvári, 2009). Previous theoretical attempts to construct incremental methods with the above properties include that of (Szepesvári & Smart, 2004) and (Melo et al., 2008), which also discuss relevant prior literature. The first of these works suggests to use interpolative function approximation techniques (restricting the features), the second work proves convergence only in the case when the sample distribution and the features are matched in some sense. Both works prove convergence to a fixed point of a suitably defined operator. In contrast, our algorithm is not restricted in the choice of the features. However, we are able to prove only convergence to the equilibria of a suitably defined cost function. The cost function that our algorithm attempts to minimize is the projected Bellman error (Sutton et al., 2009a) which is extended to the control setting in this paper. 1. The learning problem We assume that the reader is familiar with basic concepts of MDPs (for a refreshment of these concepts, we refer the reader to Sutton & Barto (1998)). The purpose of this section is to define the learning problem and to define our notation. We consider the following latent learning scenario: An agent interacts with its environment. The interaction results in a sequence S0, A0, R1, S1, A1, . . . of random variables, where for t ≥ 0, St ∈ S are states, At ∈ A are actions, Rt+1 ∈ R are rewards. Fix t ≥ 0 and let Ht = (S0, A0, R1, . . . , St) be the history up to time t. It is assumed that a fixed behavior policy πb is used to generate the actions: At ∼ πb(·|St), independently of the history Ht given St. Thus, here for any s ∈ S, πb(·|s) is a probability distribution over A. It is also assumed that (St+1, Rt+1) ∼ P (·, ·|St, At), independently of Ht given St, At. Here P is the joint nextstate and reward distribution kernel. For simplicity, we assume that (St, At) is in its steady-state and we use μ to denote the underlying distribution. The goal of the agent is to learn an optimal policy for the MDP, M = (S,A, P ), with respect to the total expected discounted reward criterion. The optimal action-value function under this criterion shall be deTo avoid measurability issues assume that S,A are at most countably infinite. However, the results extend to more general spaces with some additional assumptions. noted by Q∗. As it is well known, acting greedily w.r.t. Q∗ leads to an optimal policy. Remember that a policy π is greedy w.r.t. an action-value function Q if for every state s, π selects an action (possibly random) amongst the maximizers of Q(s, ·). The Bellman operator acting on action-value functions underlying a stationary policy π shall be denoted by T, and is defined by TQ (s, a) = ∫ {r(s, a, s′) + γQ(s′, b)}π(db|s)PS(ds|s, a), where r(s, a, s′) is the expected immediate reward of transition (s, a, s′), PS(·|s, a) is the next-state distribution (the marginal of P (·, ·|s, a)) and we are slightly abusing notation by using integral signs to denote both sums and integrals, depending on whether the respective spaces are discrete or continuous. 2. Derivation of Greedy-GQ The purpose of this section is to derive the new algorithm. We use linear value function approximation of the form Qθ(s, a) = θ >φ(s, a), (s, a) ∈ S × A, to approximate Q∗. Here φ(s, a) ∈ R are the features, θ ∈ R are the parameters to be tuned. We also employ a class of stationary policies, (πθ; θ ∈ R). For each θ ∈ R, πθ is a stationary policy (possibly stochastic). We will use πθ(·|s) to denote the actionselection probability distribution chosen by πθ at state s. Two choices of particular interest are the greedy class and the (truncated) Gibbs class: For the greedy class, for any θ ∈ R, πθ(·|s) is a greedy policy w.r.t. Qθ. For the Gibbs class, the set A is assumed to be countable and πθ(a|s) ∝ eθ, where (e.g.) κ(x) = c/(1 + exp(−x)) with some c > 0. The main idea of the algorithm is to minimize the projected Bellman error

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Policy Gradients to Account for Changes in Behaviour Policies Using Policy Gradients to Account for Changes in Behaviour Policies under Off-policy Control

Off-policy learning refers to the problem of learning the value function of a behaviour, or policy, while selecting actions with a different policy. Gradient-based off-policy learning algorithms, such as GTD (Sutton et al., 2009b) and TDC/GQ (Sutton et al., 2009a), converge when selecting actions with a fixed policy even when using function approximation and incremental updates. In control prob...

متن کامل

Weighted importance sampling for off-policy learning with linear function approximation

Abstract Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show tha...

متن کامل

Q($\lambda$) with Off-Policy Corrections

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided ...

متن کامل

University of Alberta Gradient Temporal - Difference Learning Algorithms

We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of memory and per-time-step computation, scales linearly with the number of learning parameters. TD methods are powerful prediction techniques, and with function approximation form a core part of modern reinforcement learning (RL). However, the most popular T...

متن کامل

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for compensating for the bias of value function estimators caused by the difference between the data-sampling policy and the target policy. However, existing off-policy methods often ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010